Sentiment Analysis and Product Recommendation¶

In today's data-driven world, unstructured text data is becoming increasingly valuable. From customer reviews and feedback to social media posts and emails, organizations have access to a wealth of unstructured data that can provide insights into customer behavior and sentiment. However, analyzing and making sense of this data can be a daunting task, requiring a combination of natural language processing (NLP) techniques and machine learning algorithms.

In this article, we will walk through a complete data science project for sentiment analysis and product recommendation using Python. We will cover key skills and technologies, including text preprocessing, feature extraction, machine learning, model selection and evaluation, and deployment.

The Problem: Sentiment Analysis and Product Recommendation

Our goal is to build a model that can analyze customer feedback from various channels, such as social media, email, and support tickets, and categorize the feedback into positive, negative, or neutral sentiment. We will use Python and NLP libraries, such as NLTK and spaCy, to preprocess the text and extract features. We will then train a sentiment classification model, such as Naïve Bayes or BERT, to perform the sentiment analysis.

We will then use the sentiment analysis results to enhance a product recommendation engine. By combining customer sentiment data with purchase history and browsing behavior, we can build a collaborative filtering or content-based recommendation algorithm to suggest relevant products to customers.

The Data: Customer Feedback¶

To demonstrate the sentiment analysis and product recommendation project, we will use a dataset of customer feedback. The dataset contains the following columns:

  • target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
  • ids: The id of the tweet ( 2087)
  • date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
  • flag: The query (lyx). If there is no query, then this value is NO_QUERY.
  • user: the user that tweeted
  • text: the text of the tweet.

The text column will be used as input to our sentiment analysis model, while the user and target columns will be used for our product recommendation model.

Step 1: Text Preprocessing¶

The first step in our sentiment analysis and product recommendation project is to preprocess the text data. This involves a number of steps, including removing URLs and user mentions, removing special characters and digits, converting to lowercase, tokenizing the text, removing stop words, and lemmatizing the words.

In natural language processing, tokenization is the process of breaking up a text into smaller units called tokens. These tokens could be words, phrases, or even individual characters. The tokenization process is a crucial step in many natural language processing tasks, including sentiment analysis, machine translation, and named entity recognition.

The tokenization process involves several steps:

  1. Splitting the text into words: The first step in tokenization is to split the text into words. This is usually done using whitespace or punctuation marks as delimiters.
  2. Removing punctuation and special characters: Once the text has been split into words, any punctuation or special characters are removed.
  3. Converting to lowercase: The tokens are then converted to lowercase to ensure that words with different capitalization are not treated as separate tokens.
  4. Removing stop words: Stop words are words that are commonly used in a language but do not carry much meaning, such as "a", "the", and "is". These words are removed from the token list as they are unlikely to contribute to the meaning of the text.

After tokenization, the next step is often lemmatization. Lemmatization is the process of reducing words to their base or root form. For example, the word "running" can be reduced to its base form "run". This is important because it helps to reduce the dimensionality of the feature space, which can make natural language processing tasks more computationally efficient.

The lemmatization process involves several steps:

  1. Part of speech tagging: The first step in lemmatization is to identify the part of speech of each token in the text. This is done using a part of speech tagger.
  2. Lemmatization: Once the part of speech of each token has been identified, the tokens are lemmatized. This involves reducing the word to its base or root form based on its part of speech.
  3. Removing stop words: As with tokenization, stop words are removed from the lemmatized text.

Overall, the tokenization and lemmatization process are essential steps in natural language processing. They help to convert raw text into a format that can be processed by machine learning algorithms, and they reduce the dimensionality of the feature space, making natural language processing tasks more computationally efficient.

In [1]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from  nltk.stem import SnowballStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split as surprise_train_test_split
from surprise.accuracy import rmse, mae
from surprise.model_selection import GridSearchCV
import multiprocessing

We will use the NLTK library to perform the text preprocessing. The following code shows a function that takes a text string as input and returns the preprocessed text:

In [7]:
def preprocess(text, stem=False):
    # Remove link,user and special characters
    text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
    tokens = []
    for token in text.split():
        if token not in stop_words:
            if stem:
                tokens.append(stemmer.stem(token))
            else:
                tokens.append(token)
    return " ".join(tokens)
In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ronal\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ronal\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ronal\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ronal\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Out[2]:
True
In [3]:
# Load the data
df = pd.read_csv("./data/training.1600000.processed.noemoticon.csv",names=['target','ids','date','flag','user','text']
                 ,encoding="ISO-8859-1"
                 ,dtype={'target':'int','ids':'int','date':'string','flag':'string','user':'string','text':'string'}
                )
display(df.head())
display(df.info())
target ids date flag user text
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t...
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by ...
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus @Kenichan I dived many times for the ball. Man...
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF my whole body feels itchy and like its on fire
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nationwideclass no, it's not behaving at all....
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1600000 non-null  int32 
 1   ids     1600000 non-null  int32 
 2   date    1600000 non-null  string
 3   flag    1600000 non-null  string
 4   user    1600000 non-null  string
 5   text    1600000 non-null  string
dtypes: int32(2), string(4)
memory usage: 61.0 MB
None
In [4]:
decode_map = {0: "NEGATIVE", 2: "NEUTRAL", 4: "POSITIVE"}
def decode_sentiment(label):
    return decode_map[int(label)]
In [5]:
%%time
df.target = df.target.apply(lambda x: decode_sentiment(x))
CPU times: total: 406 ms
Wall time: 485 ms
In [6]:
stop_words = stopwords.words("english")
stemmer = SnowballStemmer("english")
TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"
In [8]:
%%time
df["clean_text"] = df["text"].apply(lambda x: preprocess(x))
CPU times: total: 44.1 s
Wall time: 45 s

Step 2: Feature Extraction

Once we have preprocessed the text data, the next step is to extract features from it. We will use the TF-IDF algorithm to extract features from the preprocessed text data. TF-IDF stands for term frequency-inverse document frequency and is a numerical statistic that reflects how important a word is to a document in a collection or corpus.

We will use the TfidfVectorizer() function from the scikit-learn library to perform the feature extraction. The following code shows how to initialize the TfidfVectorizer() with some common parameters:

In [9]:
%%time

# Train a sentiment analysis model
X_train, X_test, y_train, y_test = train_test_split(df["clean_text"], df["target"], test_size=0.2, random_state=42)

tfidf_vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1, 3), stop_words="english", use_idf=True, smooth_idf=True, sublinear_tf=True)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
CPU times: total: 1min 4s
Wall time: 1min 4s

Step 3: Sentiment Analysis¶

With the features extracted, we can now train a machine learning model to perform the sentiment analysis. We will use a linear support vector machine (SVM) classifier to perform the sentiment analysis. SVMs are a popular choice for text classification problems because they are effective at handling high-dimensional data and can handle non-linear relationships between features.

We will use the LinearSVC() function from the scikit-learn library to train the SVM classifier. The following code shows how to initialize the LinearSVC() with some common parameters:

In [12]:
%%time 

svm_classifier = SVC(kernel="linear", max_iter=1000, tol=0.01)

#We can then fit the SVM classifier to the TF-IDF features and the target column:
svm_classifier.fit(X_train_tfidf, y_train)

y_pred = svm_classifier.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
print("Sentiment analysis accuracy:", accuracy)

report = classification_report(y_test, y_pred)
print("Sentiment analysis classification report:\n", report)

conf_matrix = confusion_matrix(y_test, y_pred)
print("Sentiment analysis confusion matrix:\n", conf_matrix)
c:\Users\ronal\anaconda3\lib\site-packages\sklearn\svm\_base.py:299: ConvergenceWarning: Solver terminated early (max_iter=1000).  Consider pre-processing your data with StandardScaler or MinMaxScaler.
  warnings.warn(
Sentiment analysis accuracy: 0.530746875
Sentiment analysis classification report:
               precision    recall  f1-score   support

    NEGATIVE       0.69      0.11      0.19    159494
    POSITIVE       0.52      0.95      0.67    160506

    accuracy                           0.53    320000
   macro avg       0.60      0.53      0.43    320000
weighted avg       0.60      0.53      0.43    320000

Sentiment analysis confusion matrix:
 [[ 17049 142445]
 [  7716 152790]]
CPU times: total: 2min 20s
Wall time: 2min 22s

Step 4: Model Selection and Evaluation¶

To ensure that our machine learning models are performing well, we need to evaluate their performance using appropriate metrics. We will use cross-validation and grid search techniques to find the best hyperparameters for the machine learning models, and use metrics such as accuracy, classification report, and confusion matrix to evaluate their performance.

We can use the cross_val_score() function from the scikit-learn library to perform cross-validation on the SVM classifier:

In [17]:
target_map = {'NEGATIVE': 0, 'POSITIVE': 1}
df['target'] = df['target'].map(target_map)
In [18]:
# Train a product recommendation model
reader = Reader(rating_scale=(0, 4))
data = Dataset.load_from_df(df[["user", "ids", "target"]], reader)
trainset, testset = surprise_train_test_split(data, test_size=0.2, random_state=42)

param_grid = {"n_epochs": [10, 20], "lr_all": [0.002, 0.005], "reg_all": [0.4, 0.6]}
grid_search = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=3)
grid_search.fit(data)

best_rmse = grid_search.best_score["rmse"]
best_mae = grid_search.best_score["mae"]
best_params = grid_search.best_params["rmse"]

print("Product recommendation RMSE:", best_rmse)
print("Product recommendation MAE:", best_mae)
print("Product recommendation best parameters:", best_params)

algo = SVD(n_epochs=best_params["n_epochs"], lr_all=best_params["lr_all"], reg_all=best_params["reg_all"], verbose=False)
algo.fit(trainset)
predictions = algo.test(testset)

rmse_score = rmse(predictions)
mae_score = mae(predictions)

print("Product recommendation RMSE:", rmse_score)
print("Product recommendation MAE:", mae_score)

# Combine sentiment analysis and product recommendation
df["predicted_sentiment"] = svm_classifier.predict(tfidf_vectorizer.transform(df["clean_text"]))
df["predicted_rating"] = df.apply(lambda row: algo.predict(row["user"], row["ids"]).est, axis=1)

# Save the results to a new CSV file
df.to_csv("customer_feedback_with_predictions.csv", index=False)
Product recommendation RMSE: 0.4772411223324102
Product recommendation MAE: 0.468828564792572
Product recommendation best parameters: {'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.4}
RMSE: 0.4753
MAE:  0.4660
Product recommendation RMSE: 0.47527843956146315
Product recommendation MAE: 0.4659663620618836

Make personalized recommendations for a user¶

In [19]:
def recommend(user):
    # Get the list of item IDs the user has already rated
    rated_ids = list(df.loc[df["user"] == user, "ids"])
    # Get the list of all item IDs
    all_ids = list(df["ids"].unique())
    # Remove the already rated items from the list of all items
    item_ids = [id for id in all_ids if id not in rated_ids]
    # Create a list of (item ID, predicted rating) tuples for the user
    item_ratings = [(id, algo.predict(user, id).est) for id in item_ids]
    # Sort the items by predicted rating (descending order)
    item_ratings.sort(key=lambda x: x[1], reverse=True)
    # Return the top 5 recommended items
    top_items = item_ratings[:5]
    return top_items

Test the recommendation function¶

In [20]:
user = "john123"
recommended_items = recommend(user)
print("Recommended items for user", user, ":")
for item in recommended_items:
    print("- Item ID:", item[0], "- Predicted rating:", item[1])
Recommended items for user john123 :
- Item ID: 1966876510 - Predicted rating: 0.58289153336724
- Item ID: 1693433586 - Predicted rating: 0.5789923239998461
- Item ID: 1957741658 - Predicted rating: 0.5787092243641766
- Item ID: 2055182704 - Predicted rating: 0.5786675266045753
- Item ID: 1881797143 - Predicted rating: 0.5784505736784116

Conclusion¶

In this article, we have walked through a complete data science project for sentiment analysis and product recommendation using Python. We have covered key skills and technologies, including text preprocessing, feature extraction, machine learning, model selection and evaluation, and deployment.

By showcasing this project in a data science portfolio, a data scientist can demonstrate their proficiency in these key areas to potential employers. With the rise of unstructured text data and the need to make sense of it, data scientists who can perform sentiment analysis and build product recommendation engines will be in high demand in the years to come.

In [ ]: